{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Gaussian Processes in Shogun"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "By Heiko Strathmann - <a href=\"mailto:heiko.strathmann@gmail.com\">heiko.strathmann@gmail.com</a> - <a href=\"https://github.com/karlnapf\">github.com/karlnapf</a> - <a href=\"http://herrstrathmann.de\">herrstrathmann.de</a>.\n",
      "\n",
      "\n",
      "Based on the GP framework of the <a href=\"http://www.google-melange.com/gsoc/project/google/gsoc2013/votjak/8001\">Google summer of code 2013 project</a> of Roman Votyakov - <a href=\"mailto:votjakovr@gmail.com\">votjakovr@gmail.com</a> - <a href=\"https://github.com/votjakovr\">github.com/votjakovr</a>, and the <a href=\"http://www.google-melange.com/gsoc/project/google/gsoc2012/walke434/39001\">Google summer of code 2012 project</a> of Jacob Walker - <a href=\"mailto:walke434@gmail.com\">walke434@gmail.com</a> - <a href=\"https://github.com/puffin444\">github.com/puffin444</a> "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This notebook is about Bayesian regression and classification models with Gaussian Process (GP) priors in Shogun. After providing a semi-formal introductions, we illustrate how to efficiently train them, use them for predictions, and automatically learn parameters ."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# import all shogun classes\n",
      "from modshogun import *"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Some Formal Background (Skip if you just want code examples)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This notebook is about Bayesian regression models with Gaussian Process priors. A Gaussian Process (GP) over real valued functions on some domain $\\mathcal{X}$,  $f(\\mathbf{x}):\\mathcal{X} \\rightarrow \\mathbb{R}$, written as\n",
      "\n",
      "$\\mathcal{GP}(m(\\mathbf{x}), k(\\mathbf{x},\\mathbf{x}')),$\n",
      "\n",
      "defines a distribution over real valued functions with mean value $m(\\mathbf{x})=\\mathbb{E}[f(\\mathbf{x})]$ and inter-function covariance $k(\\mathbf{x},\\mathbf{x}')=\\mathbb{E}[(f(\\mathbf{x})-m(\\mathbf{x}))^T(f(\\mathbf{x})-m(\\mathbf{x})]$. This intuitively means tha the function value at any point $\\mathbf{x}$, i.e., $f(\\mathbf{x})$ is a random variable with mean $m(\\mathbf{x})$; if you take the average of infinitely many functions from the Gaussian Process, and evaluate them at $\\mathbf{x}$, you will get this value. Similarily, the function values at two different points $\\mathbf{x}, \\mathbf{x}'$ have covariance $k(\\mathbf{x}, \\mathbf{x}')$. The formal definition is that Gaussian Process is a collection of random variables (may be infinite) of which any finite subset have a joint Gaussian distribution.\n",
      "\n",
      "One can model data with Gaussian Processes via defining a joint distribution over\n",
      "\n",
      "* $n$ data (labels in Shogun) $\\mathbf{y}\\in \\mathcal{Y}^n$, from a $n$ dimensional continous (regression) or discrete (classification) space. These data correspond to $n$ covariates $\\mathbf{x}_i\\in\\mathcal{X}$ (features in Shogun) from the input space $\\mathcal{X}$.\n",
      "* Hyper-parameters $\\boldsymbol{\\theta}$ which depend on the used model (details follow).\n",
      "* Latent Gaussian variables $\\mathbf{f}\\in\\mathbb{R}^n$, coming from a GP, i.e., they have a joint Gaussian distribution. Every entry $f_i$ corresponds to the GP function $f(\\mathbf{x_i})$ evaluated at covariate $\\mathbf{x}_i$ for $1\\leq i \\leq n$. \n",
      "\n",
      "The joint  distribution takes the form\n",
      "\n",
      "$p(\\mathbf{f},\\mathbf{y},\\theta)=p(\\boldsymbol{\\theta})p(\\mathbf{f}|\\boldsymbol{\\theta})p(\\mathbf{y}|\\mathbf{f}),$\n",
      "\n",
      "where $\\mathbf{f}|\\boldsymbol{\\theta}\\sim\\mathcal{N}(\\mathbf{m}_\\theta, \\mathbf{C}_\\theta)$ is the joint Gaussian distribution  for the GP variables, with mean $\\mathbf{m}_\\boldsymbol{\\theta}$ and covariance $\\mathbf{C}_\\theta$. The $(i,j)$-th entriy of $\\mathbf{C}_\\boldsymbol{\\theta}$ is given by the covariance or kernel between the $(i,j)$-th  covariates $k(\\mathbf{x}_i, \\mathbf{x}_j)$. Examples for kernel and mean functions are given later in the notebook.\n",
      "\n",
      "Mean and covariance are both depending on hyper-parameters coming from a prior distribution $\\boldsymbol{\\theta}\\sim p(\\boldsymbol{\\theta})$. The data itself $\\mathbf{y}\\in \\mathcal{Y}^n$ (no assumptions on $\\mathcal{Y}$ for now) is modelled by a likelihood function $p(\\mathbf{y}|\\mathbf{f})$, which gives the probability of the data $\\mathbf{y}$ given a state of the latent Gaussian variables $\\mathbf{f}$, i.e. $p(\\mathbf{y}|\\mathbf{f}):\\mathcal{Y}^n\\rightarrow [0,1]$.\n",
      "\n",
      "In order to do inference for a new, unseen covariate $\\mathbf{x}^*\\in\\mathcal{X}$, i.e., predicting its label $y^*\\in\\mathcal{Y}$ or in particular computing the predictive distribution for that label, we have integrate over the posterior over the latent Gaussian variables (assume fixed $\\boldsymbol{\\theta}$ for now, which means you can just ignore the symbol in the following if you want),\n",
      "\n",
      "$p(y^*|\\mathbf{y}, \\boldsymbol{\\theta})=\\int p(\\mathbf{y}^*|\\mathbf{f})p(\\mathbf{f}|\\mathbf{y}, \\boldsymbol{\\theta})d\\mathbf{f}|\\boldsymbol{\\theta}.$\n",
      "\n",
      "This posterior, $p(\\mathbf{f}|\\mathbf{y}, \\boldsymbol{\\theta})$, can be obtained using standard <a href=\"http://en.wikipedia.org/wiki/Bayes'_theorem\">Bayes-Rule</a> as\n",
      "\n",
      "$p(\\mathbf{f}|\\mathbf{y},\\boldsymbol{\\theta})=\\frac{p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f}|\\boldsymbol{\\theta})}{p(\\mathbf{y}|\\boldsymbol{\\theta})},$\n",
      "\n",
      "with the so called *evidence* or *marginal likelihood* $p(\\mathbf{y}|\\boldsymbol{\\theta})$ given as another integral over the prior over the latent Gaussian variables\n",
      "\n",
      "$p(\\mathbf{y}|\\boldsymbol{\\theta})=\\int p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f}|\\boldsymbol{\\theta})d\\mathbf{f}|\\boldsymbol{\\theta}$.\n",
      "\n",
      "In order to solve the above integrals, Shogun offers a variety of approximations. Don't worry, you will not have to deal with these nasty integrals on your own, but everything is hidden within Shogun. Though, if you like to play with these objects, you will be able to compute only parts.\n",
      "\n",
      "Note that in the above description, we did not make any assumptions on the input space $\\mathcal{X}$. As long as you define mean and covariance functions, and a likelihood, your data can have *any* form you like. Shogun in fact is able to deal with  standard <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDenseFeatures.html\">dense numerical data</a>, <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSparseFeatures.html\"> sparse data</a>, and <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CStringFeatures.html\">strings of *any* type</a>, and many more *out of the box*. We will provide some examples below.\n",
      "\n",
      "To gain some intuition how these latent Gaussian variables behave, and how to model data with them, see the regression part of this notebook."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Non-Linear Bayesian Regression"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Bayesian regression with Gaussian Processes is among the most fundamental applications of latent Gaussian models. As usual, the oberved data come from a contintous space, i.e. $\\mathbf{y}\\in\\mathbb{R}^n$, which is represented in the Shogun class <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CRegressionLabels.html\">CRegressionLabels</a>. We assume that these observations come from some distribution $p(\\mathbf{y}|\\mathbf{f)}$ that is based on a fixed state of latent Gaussian response variables $\\mathbf{f}\\in\\mathbb{R}^n$. In fact, we assume that the true model *is* the latent Gaussian response variable (which defined a *distribution* over functions; plus some Gaussian observation noise which is modelled by the likelihood as\n",
      "\n",
      " $p(\\mathbf{y}|\\mathbf{f})=\\mathcal{N}(\\mathbf{f},\\sigma^2\\mathbf{I})$\n",
      " \n",
      " This simple likelihood is implemented in the Shogun class <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGaussianLikelihood.html\">CGaussianLikelihood</a>. It is the well known bell curve. Below, we plot the likelihood as a function of $\\mathbf{y}$, for $n=1$."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# plot likelihood for three different noise lebels $\\sigma$ (which is not yet squared)\n",
      "sigmas=array([0.5,1,2])\n",
      "\n",
      "# likelihood instance\n",
      "lik=GaussianLikelihood()\n",
      "\n",
      "# A set of labels to consider\n",
      "lab=RegressionLabels(linspace(-4.0,4.0, 200))\n",
      "\n",
      "# A single 1D Gaussian response function, repeated once for each label\n",
      "# this avoids doing a loop in python which would be slow\n",
      "F=zeros(lab.get_num_labels())\n",
      "\n",
      "# plot likelihood for all observations noise levels\n",
      "figure(figsize=(12, 4))\n",
      "for sigma in sigmas:\n",
      "    # set observation noise, this is squared internally\n",
      "    lik.set_sigma(sigma)\n",
      "    \n",
      "    # compute log-likelihood for all labels\n",
      "    log_liks=lik.get_log_probability_f(lab, F)\n",
      "    \n",
      "    # plot likelihood functions, exponentiate since they were computed in log-domain\n",
      "    plot(lab.get_labels(), exp(log_liks))\n",
      "    \n",
      "ylabel(\"$p(y_i|f_i)$\")\n",
      "xlabel(\"$y_i$\")\n",
      "title(\"Regression Likelihoods for different observation noise levels\")\n",
      "_=legend([\"sigma=$%.1f$\" % sigma for sigma in sigmas])\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Apart from its apealling form, this curve has the nice property of given rise to *analytical* solutions to the required integrals. Recall these are given by\n",
      "\n",
      "$p(y^*|\\mathbf{y}, \\boldsymbol{\\theta})=\\int p(\\mathbf{y}^*|\\mathbf{f})p(\\mathbf{f}|\\mathbf{y}, \\boldsymbol{\\theta})d\\mathbf{f}|\\boldsymbol{\\theta},$\n",
      "\n",
      "and\n",
      "\n",
      "$p(\\mathbf{y}|\\boldsymbol{\\theta})=\\int p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f}|\\boldsymbol{\\theta})d\\mathbf{f}|\\boldsymbol{\\theta}$.\n",
      "\n",
      "Since all involved elements, the likelihood $p(\\mathbf{y}|\\mathbf{f})$, the GP prior $p(\\mathbf{f}|\\boldsymbol{\\theta})$ are Gaussian, the same follows for the GP posterior $p(\\mathbf{f}|\\mathbf{y}, \\boldsymbol{\\theta})$, and the marginal likelihood $p(\\mathbf{y}|\\boldsymbol{\\theta})$. Therefore, we just need to sit down with pen and paper to derive the resulting forms of the Gaussian distributions of these objects (see references). Luckily, everything is already implemented in Shogun."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In order to get some intuition about Gaussian Processes in general, let us first have a look at these latent Gaussian variables, which define a probability distribution over real values functions $f(\\mathbf{x}):\\mathcal{X} \\rightarrow \\mathbb{R}$, where in the regression case, $\\mathcal{X}=\\mathbb{R}$.\n",
      "\n",
      "As mentioned above, the joint distribution of a finite number (say $n$) of variables $\\mathbf{f}\\in\\mathbb{R}^n$ from a Gaussian Process $\\mathcal{GP}(m(\\mathbf{x}), k(\\mathbf{x},\\mathbf{x}'))$, takes the form\n",
      "\n",
      "$\\mathbf{f}|\\boldsymbol{\\theta}\\sim\\mathcal{N}(\\mathbf{m}_\\theta, \\mathbf{C}_\\theta),$\n",
      "\n",
      "where $\\mathbf{m}_\\theta$ is the mean function's mean and $\\mathbf{C}_\\theta$ is the pairwise covariance or kernel matrix of the input covariates $\\mathbf{x}_i$. This means, we can easily sample function realisations $\\mathbf{f}^{(j)}$ from the Gaussian Process, and more important, visualise them.\n",
      "\n",
      "To this end, let us consider the well-known and often used *Gaussian Kernel* or *squared exponential covariance*, which is implemented in the Shogun class <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGaussianKernel.html\">CGaussianKernel</a> in the parametric from (note that there are other forms in the literature)\n",
      "\n",
      "$ k(\\mathbf{x}, \\mathbf{x}')=\\exp\\left( -\\frac{||\\mathbf{x}-\\mathbf{x}'||_2^2}{\\tau}\\right),$\n",
      "\n",
      "where $\\tau$ is a hyper-parameter of the kernel. We will also use the constant <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CZeroMean.html\">CZeroMean</a> mean function, which is suitable if the data's mean is zero (which can be achieved via removing it).\n",
      "\n",
      "Let us consider some toy regression data in the form of a sine wave, which is observed at random points with some observations noise."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def generate_regression_toy_data(n=50, n_test=100, x_range=15, x_range_test=20, noise_var=0.4):\n",
      "    # training and test sine wave, test one has more points\n",
      "    X_train = rand(n)*x_range\n",
      "    X_test = linspace(0,x_range_test, 500)\n",
      "    \n",
      "    # add noise to training observations\n",
      "    y_test = sin(X_test)\n",
      "    y_train = sin(X_train)+random.randn(n)*noise_var\n",
      "    \n",
      "    return X_train, y_train, X_test, y_test"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_train, y_train, X_test, y_test = generate_regression_toy_data()\n",
      "figure(figsize=(16,4))\n",
      "plot(X_train, y_train, 'ro')\n",
      "plot(X_test, y_test)\n",
      "legend([\"Noisy observations\", \"True model\"])\n",
      "title(\"One-Dimensional Toy Regression Data\")\n",
      "xlabel(\"$\\mathbf{x}$\")\n",
      "_=ylabel(\"$\\mathbf{y}$\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, we compute the kernel matrix $\\mathbf{C}_\\boldsymbol{\\theta}$ using the <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGaussianKernel.html\">CGaussianKernel</a> with hyperparameter $\\boldsymbol{\\theta}=\\{\\tau\\}$ with a few differnt values. Note that in Gaussian Processes, kernels usually have a scaling parameter. We skip this one for now and cover it later."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# bring data into shogun representation (features are 2d-arrays, organised as column vectors)\n",
      "feats_train=RealFeatures(X_train.reshape(1,len(X_train)))\n",
      "feats_test=RealFeatures(X_test.reshape(1,len(X_test)))\n",
      "labels_train=RegressionLabels(y_train)\n",
      "\n",
      "# compute covariances for different kernel parameters\n",
      "taus=asarray([.1,4.,32.])\n",
      "Cs=zeros(((len(X_train), len(X_train), len(taus))))\n",
      "for i in range(len(taus)):\n",
      "    # compute unscalled kernel matrix (first parameter is maximum size in memory and not very important)\n",
      "    kernel=GaussianKernel(10, taus[i])\n",
      "    kernel.init(feats_train, feats_train)\n",
      "    Cs[:,:,i]=kernel.get_kernel_matrix()\n",
      "\n",
      "\n",
      "# plot\n",
      "figure(figsize=(16,5))\n",
      "for i in range(len(taus)):\n",
      "    subplot(1,len(taus),i+1)\n",
      "    imshow(Cs[:,:,i], interpolation=\"nearest\")\n",
      "    xlabel(\"Covariate index\")\n",
      "    ylabel(\"Covariate index\")\n",
      "    _=title(\"tau=%.1f\" % taus[i])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This matrix, as any kernel or covariance matrix, is positive semi-definite and symmetric. It can be viewed as a similarity matrix. Here, elements on the diagonal (corresponding to $\\mathbf{x}=\\mathbf{x}'$) have largest similarity. For increasing kernel bandwidth $\\tau$, more and more elements are similar. This matrix fully specifies a distribution over functions $f(\\mathbf{x}):\\mathcal{X}\\rightarrow\\mathbb{R}$ over a finite set of latent Gaussian variables $\\mathbf{f}$, which we can sample from and plot. To this end, we use the Shogun class <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CStatistics.html\">CStatistics</a>, which offers a method to sample from multivariate Gaussians."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "figure(figsize=(16,5))\n",
      "suptitle(\"Random Samples from GP prior\")\n",
      "for i in range(len(taus)):\n",
      "    subplot(1,len(taus),i+1)\n",
      "    \n",
      "    # sample a bunch of latent functions from the Gaussian Process\n",
      "    # note these vectors are stored row-wise\n",
      "    F=Statistics.sample_from_gaussian(zeros(len(X_train)), Cs[:,:,i], 3)\n",
      "    \n",
      "    for j in range(len(F)):\n",
      "        # sort points to connect the dots with lines\n",
      "        sorted_idx=X_train.argsort()\n",
      "\n",
      "        plot(X_train[sorted_idx], F[j,sorted_idx], '-', markersize=6)\n",
      "    \n",
      "    xlabel(\"$\\mathbf{x}_i$\")\n",
      "    ylabel(\"$f(\\mathbf{x}_i)$\")\n",
      "    _=title(\"tau=%.1f\" % taus[i])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note how the functions are exactly evaluated at the training covariates $\\mathbf{x}_i$ which are randomly distributed on the x-axis. Even though these points do not visualise the full functions (we can only evaluate them at a *finite* number of points, but we connected the points with lines to make it more clear), this reveils that larger values of the kernel bandwidth $\\tau$ lead to smoother latent Gaussian functions.\n",
      "\n",
      "In the above plots all functions are equally possible. That is, the prior of the latent Gaussian variables $\\mathbf{f}|\\boldsymbol{\\theta}$ does not favour any particular function setups. Computing the posterior given our training data, the distribution ober $\\mathbf{f}|\\mathbf{y},\\boldsymbol{\\theta}$ then corresponds to restricting the above distribution over functions to those that explain the training data (up to observation noise). We will now use the Shogun class <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CExactInferenceMethod.html\">CExactInferenceMethod</a> to do exactly this. The class is the general basis of exact GP regression in Shogun. We have to define all parts of the Gaussian Process for the inference method."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "figure(figsize=(16,5))\n",
      "suptitle(\"Random Samples from GP posterior\")\n",
      "for i in range(len(taus)):\n",
      "    subplot(1,len(taus),i+1)\n",
      "\n",
      "    # create inference method instance with very small observation noise to make \n",
      "    inf=ExactInferenceMethod(GaussianKernel(10, taus[i]), feats_train, ZeroMean(), labels_train, GaussianLikelihood())\n",
      "    \n",
      "    C_post=inf.get_posterior_covariance()\n",
      "    m_post=inf.get_posterior_mean()\n",
      "\n",
      "    # sample a bunch of latent functions from the Gaussian Process\n",
      "    # note these vectors are stored row-wise\n",
      "    F=Statistics.sample_from_gaussian(m_post, C_post, 5)\n",
      "    \n",
      "    for j in range(len(F)):\n",
      "        # sort points to connect the dots with lines\n",
      "        sorted_idx=sorted(range(len(X_train)),key=lambda x:X_train[x])\n",
      "        plot(X_train[sorted_idx], F[j,sorted_idx], '-', markersize=6)\n",
      "        plot(X_train, y_train, 'r*')\n",
      "    \n",
      "    xlabel(\"$\\mathbf{x}_i$\")\n",
      "    ylabel(\"$f(\\mathbf{x}_i)$\")\n",
      "    _=title(\"tau=%.1f\" % taus[i])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note how the above function samples are constrained to go through our training data labels (up to observation noise), as much as their smoothness allows them. In fact, these are already samples from the predictive distribution, which gives a probability for a label $\\mathbf{y}^*$ for any covariate $\\mathbf{x}^*$. These distributions are Gaussian (!), nice to look at and extremely useful to understand the GP's underlying model. Let's plot them. We finally use the Shogun class  <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGaussianProcessRegression.html\">CGaussianProcessRegression</a> to represent the whole GP under an interface to perform inference with. In addition, we use the helper class class  <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGaussianDistribution.html\">CGaussianDistribution</a> to evaluate the log-likelihood for every test point's $\\mathbf{x}^*_j$ value $\\mathbf{y}_j^*$."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# helper function that plots predictive distribution and data\n",
      "def plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances):\n",
      "    # evaluate predictive distribution in this range of y-values and preallocate predictive distribution\n",
      "    y_values=linspace(-3,3)\n",
      "    D=zeros((len(y_values), len(X_test)))\n",
      "    \n",
      "    # evaluate normal distribution at every prediction point (column)\n",
      "    for j in range(shape(D)[1]):\n",
      "        # create gaussian distributio instance, expects mean vector and covariance matrix, reshape\n",
      "        gauss=GaussianDistribution(array(means[j]).reshape(1,), array(variances[j]).reshape(1,1))\n",
      "    \n",
      "        # evaluate predictive distribution for test point, method expects matrix\n",
      "        D[:,j]=exp(gauss.log_pdf_multiple(y_values.reshape(1,len(y_values))))\n",
      "    \n",
      "    pcolor(X_test,y_values,D)\n",
      "    colorbar()\n",
      "    contour(X_test,y_values,D)\n",
      "    plot(X_test,y_test, 'b', linewidth=3)\n",
      "    plot(X_test,means, 'm--', linewidth=3)\n",
      "    plot(X_train, y_train, 'ro')\n",
      "    legend([\"Truth\", \"Prediction\", \"Data\"])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "figure(figsize=(18,10))\n",
      "suptitle(\"GP inference for different kernel widths\")\n",
      "for i in range(len(taus)):\n",
      "    subplot(len(taus),1,i+1)\n",
      "    \n",
      "    # create GP instance using inference method and train\n",
      "    # use Shogun objects from above\n",
      "    inf.set_kernel(GaussianKernel(10,taus[i]))\n",
      "    gp=GaussianProcessRegression(inf)\n",
      "    gp.train()\n",
      "    \n",
      "    # predict labels for all test data (note that this produces the same as the below mean vector)\n",
      "    means = gp.apply(feats_test)\n",
      "    \n",
      "    # extract means and variance of predictive distribution for all test points\n",
      "    means = gp.get_mean_vector(feats_test)\n",
      "    variances = gp.get_variance_vector(feats_test)\n",
      "    \n",
      "    # note: y_predicted == means\n",
      "    \n",
      "    # plot predictive distribution and training data\n",
      "    plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances)\n",
      "    _=title(\"tau=%.1f\" % taus[i])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The question now is: Which set of hyper-parameters $\\boldsymbol{\\theta}=\\{\\tau, \\gamma, \\sigma\\}$ to take, where $\\gamma$ is the kernel scaling (which we omitted so far), and $\\sigma$ is the observation noise (which we left at its defaults value of one)? The question of model-selection will be handled in a bit more depth in the binary classification case. For now we just show code how to do it as a black box. See below for explanations."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# re-create inference method and GP instance to start from scratch, use other Shogun structures from above\n",
      "inf = ExactInferenceMethod(GaussianKernel(10, taus[i]), feats_train, ZeroMean(), labels_train, GaussianLikelihood())\n",
      "gp = GaussianProcessRegression(inf)\n",
      "\n",
      "# evaluate our inference method for its derivatives\n",
      "grad = GradientEvaluation(gp, feats_train, labels_train, GradientCriterion(), False)\n",
      "grad.set_function(inf)\n",
      "\n",
      "# handles all of the above structures in memory\n",
      "grad_search = GradientModelSelection(grad)\n",
      "\n",
      "# search for best parameters and store them\n",
      "best_combination = grad_search.select_model()\n",
      "\n",
      "# apply best parameters to GP, train\n",
      "best_combination.apply_to_machine(gp)\n",
      "\n",
      "# we have to \"cast\" objects to the specific kernel interface we used (soon to be easier)\n",
      "best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()\n",
      "best_scale=inf.get_scale()\n",
      "best_sigma=GaussianLikelihood.obtain_from_generic(inf.get_model()).get_sigma()\n",
      "print \"Selected tau (kernel bandwidth):\", best_width\n",
      "print \"Selected gamma (kernel scaling):\", best_scale\n",
      "print \"Selected sigma (observation noise):\", best_sigma"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we can output the best parameters and plot the predictive distribution for those."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# train gp\n",
      "gp.train()\n",
      "\n",
      "# extract means and variance of predictive distribution for all test points\n",
      "means = gp.get_mean_vector(feats_test)\n",
      "variances = gp.get_variance_vector(feats_test)\n",
      "\n",
      "# plot predictive distribution\n",
      "figure(figsize=(18,5))\n",
      "plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances)\n",
      "_=title(\"Maximum Likelihood II based inference\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now the predictive distribution is very close to the true data generating process."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Non-Linear, Binary Bayesian Classification"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In binary classification, the observed data comes from a space of discrete, binary labels, i.e. $\\mathbf{y}\\in\\mathcal{Y}^n=\\{-1,+1\\}^n$, which are represented via the Shogun class <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CBinaryLabels.html\">CBinaryLabels</a>. To model these observations with a GP, we need a likelihood function $p(\\mathbf{y}|\\mathbf{f})$ that maps a  set of such discrete observations to a probability, given a fixed response $\\mathbf{f}$ of the Gaussian Process.\n",
      "\n",
      "In regression, this way straight-forward, as we could simply use the response variable $\\mathbf{f}$ itself, plus some Gaussian noise, which gave rise to a probability distribution. However, now that the $\\mathbf{y}$ are discrete, we cannot do the same thing. We rather need a function that squashes the Gaussian response variable itself to a probability, given some data. This is a common problem in Machine Learning and Statistics and is usually done with some sort of *Sigmoid* function of the form $\\sigma:\\mathbb{R}\\rightarrow[0,1]$. One popular choicefor such a function is the *Logit* likelihood, given by\n",
      "\n",
      "$p(\\mathbf{y}|\\mathbf{f})=\\prod_{i=1}^n p(y_i|f_i)=\\prod_{i=1}^n \\frac{1}{1-\\exp(-y_i f_i)}.$\n",
      "\n",
      "This likelihood is implemented in Shogun under <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLogitLikelihood.html\">CLogitLikelihood</a> and using it is sometimes refered to as *logistic regression*. Using it with GPs results in non-linear Bayesian logistic regression. We can easily use the class to illustrate the sigmoid function for a 1D example and a fixed data point with label $+1$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# two classification likelihoods in Shogun\n",
      "logit=LogitLikelihood()\n",
      "probit=ProbitLikelihood()\n",
      "\n",
      "# A couple of Gaussian response functions, 1-dimensional here\n",
      "F=linspace(-5.0,5.0)\n",
      "\n",
      "# Single observation label with +1\n",
      "lab=BinaryLabels(array([1.0]))\n",
      "\n",
      "# compute log-likelihood for all values in F\n",
      "log_liks_logit=zeros(len(F))\n",
      "log_liks_probit=zeros(len(F))\n",
      "for i in range(len(F)):\n",
      "    # Shogun expects a 1D array for f, not a single number\n",
      "    f=array(F[i]).reshape(1,)\n",
      "    log_liks_logit[i]=logit.get_log_probability_f(lab, f)\n",
      "    log_liks_probit[i]=probit.get_log_probability_f(lab, f)\n",
      "    \n",
      "    \n",
      "# in fact, loops are slow and Shogun offers a method to compute the likelihood for many f. Much faster!\n",
      "log_liks_logit=logit.get_log_probability_fmatrix(lab, F.reshape(1,len(F)))\n",
      "log_liks_probit=probit.get_log_probability_fmatrix(lab, F.reshape(1,len(F)))\n",
      "\n",
      "# plot the sigmoid functions, note that Shogun computes it in log-domain, so we have to exponentiate\n",
      "figure(figsize=(12, 4))\n",
      "plot(F, exp(log_liks_logit))\n",
      "plot(F, exp(log_liks_probit))\n",
      "ylabel(\"$p(y_i|f_i)$\")\n",
      "xlabel(\"$f_i$\")\n",
      "title(\"Classification Likelihoods\")\n",
      "_=legend([\"Logit\", \"Probit\"])\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note how the logit function maps any input value to $[0,1]$ in a continuous way. The other plot above is for another classification likelihood is implemented in Shogun is the Gaussian CDF function\n",
      "\n",
      "$p(\\mathbf{y}|\\mathbf{f})=\\prod_{i=1}^n p(y_i|f_i)=\\prod_{i=1}^n \\Phi(y_i f_i),$\n",
      "\n",
      "where $\\Phi:\\mathbb{R}\\rightarrow [0,1]$ is the <a href=\"http://en.wikipedia.org/wiki/Cumulative_distribution_function\">cumulative distribution function</a> (CDF) of the standard Gaussian distribution $\\mathcal{N}(0,1)$. It is implemented in the Shogun class <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CProbitLikelihood.html\">CProbitLikelihood</a> and using it is refered to as *probit regression*. While the Gaussian CDF has some convinient properties for integrating over it (and thus allowing some different modelling decisions), it doesn not really matter what you use in Shogun in most cases. However, for the sake of completeness, it is also potted above, being very similar to the logit likelihood."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "TODO: Show a function squashed through the logit likelihood"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Recall that in order to do inference, we need to solve two integrals (in addition to the Bayes rule, see above)\n",
      "\n",
      "$p(y^*|\\mathbf{y}, \\boldsymbol{\\theta})=\\int p(\\mathbf{y}^*|\\mathbf{f})p(\\mathbf{f}|\\mathbf{y}, \\boldsymbol{\\theta})d\\mathbf{f}|\\boldsymbol{\\theta},$\n",
      "\n",
      "and\n",
      "\n",
      "$p(\\mathbf{y}|\\boldsymbol{\\theta})=\\int p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f}|\\boldsymbol{\\theta})d\\mathbf{f}|\\boldsymbol{\\theta}$.\n",
      "\n",
      "In classification, the second integral is not available in closed form since it is the convolution of a Gaussian, $p(\\mathbf{f}|\\boldsymbol{\\theta})$, and a non-Gaussian, $p(\\mathbf{y}|\\mathbf{f})$, distribution. Therefore, we have to rely on approximations in order to compute and integrate over the posterior $p(\\mathbf{f}|\\mathbf{y},\\boldsymbol{\\theta})$. Shogun offers various standard methods from the literature to deal with this problem, including the Laplace approximation (<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLaplacianInferenceMethod.html\">CLaplacianInferenceMethod</a>), Expectation Propagation (<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CEPInferenceMethod.html\">CEPInferenceMethod</a>) for inference and evaluatiing the marginal likelihood. These two approximations give rise to a Gaussian posterior  $p(\\mathbf{f}|\\mathbf{y},\\boldsymbol{\\theta})$, which can then be easily computed and integrated over (all this is done by Shogun for you).\n",
      "\n",
      "While the Laplace approximation is quite fast, EP usually has a better accuracy, in particular if one is not just interetsed in binary decisions but also in certainty values for these predictions. Go for Laplace if interested in binary decisions, and for EP otherwise.\n",
      "\n",
      "TODO, add references to inference methods."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We will now give an example on how to do GP inference for binary classification in Shogun on some toy data. For that, we will first definea function to generate a classical non-linear classification problem."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def generate_classification_toy_data(n_train=100, mean_a=asarray([0, 0]), std_dev_a=1.0, mean_b=3, std_dev_b=0.5):\n",
      "\n",
      "    # positive examples are distributed normally\n",
      "    X1 = (random.randn(n_train, 2)*std_dev_a+mean_a).T\n",
      "\n",
      "    # negative examples have a \"ring\"-like form\n",
      "    r = random.randn(n_train)*std_dev_b+mean_b\n",
      "    angle = random.randn(n_train)*2*pi\n",
      "    X2 = array([r*cos(angle)+mean_a[0], r*sin(angle)+mean_a[1]])\n",
      "\n",
      "    # stack positive and negative examples in a single array\n",
      "    X_train = hstack((X1,X2))\n",
      "\n",
      "    # label positive examples with +1, negative with -1\n",
      "    y_train = zeros(n_train*2)\n",
      "    y_train[:n_train] = 1\n",
      "    y_train[n_train:] = -1\n",
      "\n",
      "    return X_train, y_train\n",
      "\n",
      "def plot_binary_data(X_train, y_train):\n",
      "    plot(X_train[0, argwhere(y_train == 1)], X_train[1, argwhere(y_train == 1)], 'ro')\n",
      "    plot(X_train[0, argwhere(y_train == -1)], X_train[1, argwhere(y_train == -1)], 'bo')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_train, y_train=generate_classification_toy_data()\n",
      "plot_binary_data(X_train, y_train)\n",
      "_=title(\"2D Toy classification problem\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We will now pass this data into Shogun representation, and use the standard Gaussian kernel (or squared exponential covariance function (<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGaussianKernel.html\">CGaussianKernel</a>)) and the Laplace approximation to obtain a decision boundary for the two classes. You can easily exchange different likelihood models and inference methods."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# for building combinations of arrays\n",
      "from itertools import product\n",
      "\n",
      "# convert training data into Shogun representation\n",
      "train_features = RealFeatures(X_train)\n",
      "train_labels = BinaryLabels(y_train)\n",
      "\n",
      "# generate all pairs in 2d range of testing data (full space), discretisation resultion is n_test\n",
      "n_test=50\n",
      "x1 = linspace(X_train[0,:].min()-1, X_train[0,:].max()+1, n_test)\n",
      "x2 = linspace(X_train[1,:].min()-1, X_train[1,:].max()+1, n_test)\n",
      "X_test = asarray(list(product(x1, x2))).T\n",
      "\n",
      "# convert testing features into Shogun representation\n",
      "test_features = RealFeatures(X_test)\n",
      "\n",
      "# create Gaussian kernel with width = 2.0\n",
      "kernel = GaussianKernel(10, 2)\n",
      "\n",
      "# create zero mean function\n",
      "zero_mean = ZeroMean()\n",
      "\n",
      "# you can easily switch between probit and logit likelihood models\n",
      "# by uncommenting/commenting the following lines:\n",
      "\n",
      "# create probit likelihood model\n",
      "# lik = ProbitLikelihood()\n",
      "\n",
      "# create logit likelihood model\n",
      "lik = LogitLikelihood()\n",
      "\n",
      "# you can easily switch between Laplace and EP approximation by\n",
      "# uncommenting/commenting the following lines:\n",
      "\n",
      "# specify Laplace approximation inference method\n",
      "#inf = LaplacianInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)\n",
      "\n",
      "# specify EP approximation inference method\n",
      "inf = EPInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)\n",
      "\n",
      "# create and train GP classifier, which uses Laplace approximation\n",
      "gp = GaussianProcessBinaryClassification(inf)\n",
      "gp.train()\n",
      "\n",
      "test_labels=gp.apply(test_features)\n",
      "\n",
      "# plot data and decision boundary\n",
      "plot_binary_data(X_train, y_train)\n",
      "pcolor(x1, x2, test_labels.get_labels().reshape(n_test, n_test))\n",
      "_=title('Decision boundary')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is already quite nice. The nice thing about Gaussian Processes now is that they are Bayesian, which means that have a full predictive distribution, i.e., we can plot the probability for a point belonging to a class. These can be obtained via the interface of <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGaussianProcessBinaryClassification.html\">CGaussianProcessBinaryClassification</a>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# obtain probabilities for \n",
      "p_test = gp.get_probabilities(test_features)\n",
      "\n",
      "# create figure\n",
      "title('Training data, predictive probability and decision boundary')\n",
      "\n",
      "# plot training data\n",
      "plot_binary_data(X_train, y_train)\n",
      "\n",
      "# plot decision boundary\n",
      "contour(x1, x2, reshape(p_test, (n_test, n_test)), levels=[0.5], colors=('black'))\n",
      "\n",
      "# plot probabilities\n",
      "pcolor(x1, x2, p_test.reshape(n_test, n_test))\n",
      "_=colorbar()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you are interested in the marginal likelihood $p(\\mathbf{y}|\\boldsymbol{\\theta})$, for example for the sake of comparing different model parameters $\\boldsymbol{\\theta}$ (more in model-selection later), it is very easy to compute it via the interface of <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CInferenceMethod.html\">CInferenceMethod</a>, i.e., every inference method in Shogun can do that. It is even possible to obtain the mean and covariance of the Gaussian approximation to the posterior $p(\\mathbf{f}|\\mathbf{y})$ using Shogun. In the following, we plot the marginal likelihood under the EP inference method (more accurate approximation) as a one dimensional function of the kernel width."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# generate some non-negative kernel widths\n",
      "widths=2**linspace(-5,6,20)\n",
      "\n",
      "# compute marginal likelihood under Laplace apprixmation for every width\n",
      "# use Shogun objects from above\n",
      "marginal_likelihoods=zeros(len(widths))\n",
      "for i in range(len(widths)):\n",
      "    # note that GP training is automatically done/updated if a parameter is changed. No need to call train again\n",
      "    kernel.set_width(widths[i])\n",
      "    marginal_likelihoods[i]=-inf.get_negative_log_marginal_likelihood()\n",
      "    \n",
      "# plot marginal likelihoods as a function of kernel width\n",
      "plot(log2(widths), marginal_likelihoods)\n",
      "title(\"Log Marginal likelihood for different kernels\")\n",
      "xlabel(\"Kernel Width in log-scale\")\n",
      "_=ylabel(\"Log-Marginal Likelihood\")\n",
      "\n",
      "print \"Width with largest marginal likelihood:\", widths[marginal_likelihoods.argmax()]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This plot clearly shows that there is one kernel width (aka hyper-parameter element $\\theta$) for that the marginal likelihood is maximised. If one was interested in the single best parameter, the above concept can be used to learn the best hyper-parameters of the GP. In fact, this is possible in a very efficient way since we have a lot of information about the geometry of the marginal likelihood function, as for example its gradient: It turns out that for example the above function is smooth and we can use the usual optimisation techniques to find extrema. This is called *maximum likelihood II*. Let's have a closer look."
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Excurs: Model-Selection with Gaussian Processes"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, let us have a look at the predictive distributions of some of the above kernel widths"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# again, use Shogun objects from above, but a few extremal widths\n",
      "widths_subset=array([widths[0], widths[marginal_likelihoods.argmax()], widths[len(widths)-1]])\n",
      "\n",
      "figure(figsize=(18, 5))\n",
      "for i in range(len(widths_subset)):\n",
      "    subplot(1,len(widths_subset),i+1)\n",
      "    kernel.set_width(widths_subset[i])\n",
      "    \n",
      "    # obtain and plot predictive distribution\n",
      "    p_test = gp.get_probabilities(test_features)\n",
      "    title_str=\"Width=%.2f, \" % widths_subset[i]\n",
      "    if i is 0:\n",
      "        title_str+=\"too complex, overfitting\"\n",
      "    elif i is 1:\n",
      "        title_str+=\"just right\"\n",
      "    else:\n",
      "        title_str+=\"too smooth, underfitting\"\n",
      "        \n",
      "    title(title_str)\n",
      "    plot_binary_data(X_train, y_train)\n",
      "    contour(x1, x2, reshape(p_test, (n_test, n_test)), levels=[0.5], colors=('black'))\n",
      "    pcolor(x1, x2, p_test.reshape(n_test, n_test))\n",
      "    _=colorbar()\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the above plots, it is quite clear that the maximum of the marginal likelihood corresponds to the best single setting of the parameters. To give some more intuition: The interpretation of the marginal likelihood\n",
      "\n",
      "$p(\\mathbf{y}|\\boldsymbol{\\theta})=\\int p(\\mathbf{y}|\\mathbf{f})p(\\mathbf{f}|\\boldsymbol{\\theta})d\\mathbf{f}|\\boldsymbol{\\theta}$\n",
      "\n",
      "is the probability of the data given the model parameters $\\boldsymbol{\\theta}$. Note that this is averaged over *all* possible configurations of the latent Gaussian variables $\\mathbf{f}|\\boldsymbol{\\theta}$  given a fixed configuration of parameters. However, since this is probability distribution, it has to integrate to $1$. This means that models that are too complex (and thus being able to explain too many different data configutations) and models that are too simple (and thus not able to explain the current data) give rise to a small marginal likelihood. Only when the model is just complex enough to explain the data well (but not more complex), the marginal likelihood is maximised. This is an implementation of a concept called <a href=\"http://en.wikipedia.org/wiki/Occam's_razor#Probability_theory_and_statistics\">Occam's razor</a>, and is a nice motivation why you should be Bayesian if you can -- overfitting doesn't happen that quickly.\n",
      "\n",
      "As mentioned before, Shogun is able to automagically learn all of the hyper-parameters $\\boldsymbol{\\theta}$ using gradient based optimisation on the marginal likelihood (whose derivatives are computed internally). To this is, we use the class <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGradientModelSelection.html\">CGradientModelSelection</a>. Note that we could also use <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGridSearchModelSelection.html\">CGridSearchModelSelection</a> to do a standard grid-search, such as is done for Support Vector Machines. However, this is highly ineffective, in particular when the number of parameters grows. In addition, order to evaluate parameter states, we have to use the classes <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGradientEvaluation.html\">CGradientEvaluation</a>, and <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1GradientCriterion.html\">GradientCriterion</a>, which is also much cheaper than the usual  <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CCrossValidation.html\">CCrossValidation</a>, since it just evaluates the gradient of the marginal likelihood rather than performing many training and testing runs. This is another very nice motivation for using Gaussian Processes: optimising parameters is much easier. In the following, we demonstrate how to select *all* parameters of the used model.  In Shogun, parameter configurations (corresponding to $\\boldsymbol{\\theta}$ are stored in instances of  <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CParameterCombination.html\">CParameterCombination</a>, which can be applied to machines.\n",
      "\n",
      "This approach is known as *maximum likelihood II* (the 2 is for the second level, averaging over all possible $\\mathbf{f}|\\boldsymbol{\\theta}$), or *evidence maximisation*."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# re-create inference method and GP instance to start from scratch, use other Shogun structures from above\n",
      "inf = EPInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)\n",
      "gp = GaussianProcessBinaryClassification(inf)\n",
      "\n",
      "# evaluate our inference method for its derivatives\n",
      "grad = GradientEvaluation(gp, train_features, train_labels, GradientCriterion(), False)\n",
      "grad.set_function(inf)\n",
      "\n",
      "# handles all of the above structures in memory\n",
      "grad_search = GradientModelSelection(grad)\n",
      "\n",
      "# search for best parameters and store them\n",
      "best_combination = grad_search.select_model()\n",
      "\n",
      "# apply best parameters to GP\n",
      "best_combination.apply_to_machine(gp)\n",
      "\n",
      "# we have to \"cast\" objects to the specific kernel interface we used (soon to be easier)\n",
      "best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()\n",
      "best_scale=inf.get_scale()\n",
      "print \"Selected kernel bandwidth:\", best_width\n",
      "print \"Selected kernel scale:\", best_scale"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This now gives us a trained Gaussian Process with the best hyper-parameters. In the above setting, this is the s <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGaussianKernel.html\">CGaussianKernel</a> bandwith, and its scale (which is stored in the GP itself since Shogun kernels do not support scalling). We can now again visualise the predictive distribution, and also output the best parameters."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# train gp\n",
      "gp.train()\n",
      "\n",
      "# visualise predictive distribution\n",
      "p_test = gp.get_probabilities(test_features)\n",
      "plot_binary_data(X_train, y_train)\n",
      "contour(x1, x2, reshape(p_test, (n_test, n_test)), levels=[0.5], colors=('black'))\n",
      "pcolor(x1, x2, p_test.reshape(n_test, n_test))\n",
      "_=colorbar()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note how nicely this predictive distribution matches the data generating distribution. Also note that the best kernel bandwidth is different to the one we saw in the above plot. This is caused by the different kernel scalling that was also learned automatically. The kernel scaling, roughly speaking, corresponds to the sharpness of the changes in the surface of the predictive likelihood. Since we have two hyper-parameters, we can plot the surface of the marginal likelihood as a function of both of them. This is sometimes interesting, for example when this surface has multiple maximum (corresponding to multiple \"best\" parameter settings), and thus might be useful for analysis. It is expensive however."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# parameter space, increase resolution if you want finer plots, takes long though\n",
      "resolution=5\n",
      "widths=2**linspace(-4,10,resolution)\n",
      "scales=2**linspace(-5,10,resolution)\n",
      "\n",
      "# re-create inference method and GP instance to start from scratch, use other Shogun structures from above\n",
      "inf = EPInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)\n",
      "gp = GaussianProcessBinaryClassification(inf)\n",
      "\n",
      "# compute marginal likelihood for every parameter combination\n",
      "# use Shogun objects from above\n",
      "marginal_likelihoods=zeros((len(widths), len(scales)))\n",
      "for i in range(len(widths)):\n",
      "    for j in range(len(scales)):\n",
      "        kernel.set_width(widths[i])\n",
      "        inf.set_scale(scales[j])\n",
      "        marginal_likelihoods[i,j]=-inf.get_negative_log_marginal_likelihood()\n",
      "    \n",
      "# contour plot of marginal likelihood as a function of kernel width and scale\n",
      "contour(log2(widths), log2(scales), marginal_likelihoods)\n",
      "colorbar()\n",
      "xlabel(\"Kernel width (log-scale)\")\n",
      "ylabel(\"Kernel scale (log-scale)\")\n",
      "_=title(\"Log Marginal Likelihood\")\n",
      "\n",
      "# plot our found best parameters\n",
      "_=plot([log2(best_width)], [log2(best_scale)], 'r*', markersize=20)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Our found maximum nicely matches the result of the \"grid-search\". The take home message for this is: With Gaussian Processes, you neither need to do expensive brute force approaches to find best paramters (but you can use gradient descent), nor do you need to do expensive cross-validation to evaluate your model (but can use the Bayesian concept of maximum likelihood II)."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Excurs: Large-Scale Regression"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One \"problem\" with the classical method of Gaussian Process based inference is the computational complexity of $\\mathcal{O}(n^3)$, where $n$ is the number of training examples. This is caused by matrix inversion, Cholesky factorization, etc. Up to a few thousand points, this is feasible. You will quickly run into memory and runtime problems for very large problems.\n",
      "\n",
      "One way of approaching very large problems is called *Fully Independent Training Components*, which is a low-rank plus diagonal approximation to the exact covariance. The rough idea is to specify a set of $m\\ll n$ *inducing points* and to base all computations on the covariance between training/test and inducing points only, which intuitively corresponds to combining various training points around an inducing point. This reduces the computational complexity to $\\mathcal{O}(nm^2)$, where again $n$ is the number of training points, and $m$ is the number of inducing points. This is quite a significant decrease, in particular if the number of inducing points is much smaller than the number of examples.\n",
      "\n",
      "The optimal way to specify inducing points is to densely and uniformly place them in the input space. However, this might be quickly become infeasible in high dimensions. In this case, a random subset of the training data might be a good idea.\n",
      "\n",
      "In Shogun, the class  <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CFITCInferenceMethod.html\">CFITCInferenceMethod</a> handles inference for regression with the <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGaussianLikelihood.html\">CGaussianLikelihood</a>. Below, we demonstrate its usage on a toy example and compare to exact regression. Note that <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGradientModelSelection.html\">CGradientModelSelection</a> still works as before. We compare the runtime for inference with both GP.\n",
      "\n",
      "First, note that changing the inference method only requires the change of a single line of code"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# for measuring runtime\n",
      "import time\n",
      "\n",
      "# simple regression data\n",
      "X_train, y_train, X_test, y_test = generate_regression_toy_data(n=1000)\n",
      "\n",
      "# bring data into shogun representation (features are 2d-arrays, organised as column vectors)\n",
      "feats_train=RealFeatures(X_train.reshape(1,len(X_train)))\n",
      "feats_test=RealFeatures(X_test.reshape(1,len(X_test)))\n",
      "labels_train=RegressionLabels(y_train)\n",
      "\n",
      "# inducing features (here: a random grid over the input space, try out others)\n",
      "n_inducing=10\n",
      "#X_inducing=linspace(X_train.min(), X_train.max(), n_inducing)\n",
      "X_inducing=rand(X_train.min()+n_inducing)*X_train.max()\n",
      "feats_inducing=RealFeatures(X_inducing.reshape(1,len(X_inducing)))\n",
      "\n",
      "# create FITC inference method and GP instance\n",
      "inf = FITCInferenceMethod(GaussianKernel(10, 1), feats_train, ZeroMean(), labels_train, \\\n",
      "                          GaussianLikelihood(), feats_inducing)\n",
      "gp = GaussianProcessRegression(inf)\n",
      "\n",
      "# and of course, this can be combined with ML2 model-selection\n",
      "grad = GradientEvaluation(gp, feats_train, labels_train, GradientCriterion(), False)\n",
      "grad.set_function(inf)\n",
      "grad_search = GradientModelSelection(grad)\n",
      "\n",
      "# search for best parameters and store them\n",
      "best_combination = grad_search.select_model(True)\n",
      "\n",
      "# apply best parameters to GP, train\n",
      "best_combination.apply_to_machine(gp)\n",
      "\n",
      "# we have to \"cast\" objects to the specific kernel interface we used (soon to be easier)\n",
      "best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()\n",
      "best_scale=inf.get_scale()\n",
      "best_sigma=GaussianLikelihood.obtain_from_generic(inf.get_model()).get_sigma()\n",
      "\n",
      "# for comparison, create exact GP with same parameters\n",
      "# also re-create old GP for fair time measurements\n",
      "# FITC GP\n",
      "start=time.time()\n",
      "inf = FITCInferenceMethod(GaussianKernel(10, best_width), feats_train, ZeroMean(), labels_train, \\\n",
      "                          GaussianLikelihood(best_sigma), feats_inducing)\n",
      "inf.set_scale(best_scale)\n",
      "gp = GaussianProcessRegression(inf)\n",
      "gp.train()\n",
      "means = gp.get_mean_vector(feats_test)\n",
      "variances = gp.get_variance_vector(feats_test)\n",
      "print \"FITC inference took %.2f seconds\" % (time.time()-start)\n",
      "\n",
      "# exact GP\n",
      "start=time.time()\n",
      "inf_exact = ExactInferenceMethod(GaussianKernel(10, best_width), feats_train, ZeroMean(), labels_train, \\\n",
      "                                 GaussianLikelihood(best_sigma))\n",
      "inf_exact.set_scale(best_scale)\n",
      "gp_exact = GaussianProcessRegression(inf_exact)\n",
      "gp_exact.train()\n",
      "means_exact = gp_exact.get_mean_vector(feats_test)\n",
      "variances_exact = gp_exact.get_variance_vector(feats_test)\n",
      "print \"Exact inference took %.2f seconds\" % (time.time()-start)\n",
      "\n",
      "# comparison plot FITC and exact inference, plot 95% confidence of both predictive distributions\n",
      "figure(figsize=(18,5))\n",
      "plot(X_test, y_test, color=\"black\", linewidth=3)\n",
      "plot(X_test, means, 'r--', linewidth=3)\n",
      "plot(X_test, means_exact, 'b--', linewidth=3)\n",
      "plot(X_train, y_train, 'ro')\n",
      "plot(X_inducing, zeros(len(X_inducing)), 'g*', markersize=15)\n",
      "\n",
      "# tube plot of 95% confidence\n",
      "error=1.96*sqrt(variances)\n",
      "plot(X_test,means-error, color='red', alpha=0.3, linewidth=3)\n",
      "fill_between(X_test,means-error,means+error,color='red', alpha=0.3)\n",
      "error_exact=1.96*sqrt(variances_exact)\n",
      "plot(X_test,means_exact-error_exact, color='blue', alpha=0.3, linewidth=3)\n",
      "fill_between(X_test,means_exact-error_exact,means_exact+error_exact,color='blue', alpha=0.3)\n",
      "\n",
      "# plot upper confidence lines later due to legend\n",
      "plot(X_test,means+error, color='red', alpha=0.3, linewidth=3)\n",
      "plot(X_test,means_exact+error_exact, color='blue', alpha=0.3, linewidth=3)\n",
      "\n",
      "legend([\"True\", \"FITC prediction\", \"Exact prediction\", \"Data\", \"Inducing points\", \"95% FITC\", \"95% Exact\"])\n",
      "_=title(\"Comparison FITC and Exact Regression\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note how the FITC inference method is much faster than exact inference for this number of data points. However, based on the placement of the inducing points, the results might be unacurate. As mentioned, a uniform spacing over the space would be optimal."
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Soon to Come"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      " * Overview of all base classes for GPs\n",
      " * Excurs: Automatic relevance determination\n",
      " * Real life examples for regression and classification, including classification of sequences\n",
      " * Fully Bayesian treatment of \\theta"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}